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Abstract 

We construct examples of contingency tables on n binary random variables where the gap 
between the linear programming lower/upper bound and the true integer lower/upper bounds 
on cell entries is exponentially large. These examples provide evidence that linear programming 
may not be an effective heuristic for detecting disclosures when releasing margins of multi-way 
tables. 

1 Introduction 

A fundamental problem in data security is to determine what information about individual sur- 
vey respondents can be inferred from the release of partial data. The particular instance of this 
problem we are interested in concerns the release of margins of a multidimensional contingency 
table. In particular, given a collection of margins of a multi-way table, can individual cell entries 
in the table be inferred. This type of problem arises when statistical agencies like a census bureau 
release summary data to the public, but are required by law to maintain the privacy of individual 
respondents. 

Many authors ^ |2J |3] have proposed that an individual cell entry is secure if, among all 
contingency tables with the given fixed marginal totals, the upper bound and lower bound for the 
cell entry are far enough apart. In general, solving the integer program associated with finding the 
sharp integer upper and lower bounds a cell entry is known to be NP-hard. A heuristic which has 
been suggested for approximating these upper and lower bounds is to solve the appropriate linear 
programming relaxation. Based on theoretical results for 2-way tables and practical experience for 
some small multi-way tables, some authors have suggested that the linear programming bounds and 
other heuristics should always constitute good approximations to the true bounds for cell values. 

In this paper, we attempt to refute the claim that the linear programming bounds are, in 
general, good approximations to the true integer bounds. In particular, we will show the following: 

Theorem 1. There is a sequence of hierarchical models on n binary random variable and a col- 
lection of margins such that the gap between the linear programming lower (upper) bounds and the 
integer programming lower (upper) bounds for a cell entry grows exponentially in n. 

For instance, on 10 binary random variables, our construction produces an instance where this 
difference is more than 100. This constitutes a significant discrepancy between the heuristic and 
reality, in a problem of size which is quite small from the practical standpoint. 
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The outline of this paper is as follows. In the next section we review hierarchical models and the 
algebraic techniques that we will use to construct our examples. The third section is devoted to the 
explicit construction, and in the fourth section we discuss practical consequences of our examples. 

2 Graphical Models, Grobner Bases, and Graver Bases 

A hierarchical model is given by a collection of subsets A of the n-element set [n] := {1, 2, . . . , n} 
together with an integer vector d = (d\, . . . ,d n ). Without loss of generality, we can take A to be 
a simplicial complex. In the setting of probabilistic inference, a hierarchical model is intended to 
encode interactions between a collection of n discrete random variables: the number of states is the 
i-th random variable is di and there is an interaction factor between the set of random variables 
indexed by each F £ A (see, for example, jH] for an introduction). From the standpoint of data 
security, n is the number of dimensions of a multi-way contingency table, the di represent the 
number of levels in each dimension, and the elements F £ A are the particular margins that are 
released. For the rest of this paper di = 2 for all i; that is, we are considering dichotomous tables 
or binary random variables. 

Computing the A-margins of a multi-way table is a linear transformation. We denote by Aa 
the matrix in the standard basis that computes these margins. Finding the minimum value for 
a cell entry given the A-margins b amounts to solving the following integer program, which we 
denote IPa- 

miniio subject to 
A/\u = b, u > 0, u integral. 
The linear programming relaxation drops the integrality condition. We denote it by LP a'- 

miniio subject to 
,4 A u = b, u > 0. 

The integer programming gap gap-(A) is the largest difference between the optimal solution of 
IPa and LP a over all feasible marginals b 0. Explicitly computing the integer programming gap 
is a difficult problem, even for quite small models A. However, using properties of Grobner bases, 
it is easy to give lower bounds on this gap. Recall the definition of a Grobner basis: 

Definition 2. A reduced Grobner basis G c of A a with respect to the cost vector c is a minimal 
set of improving vectors that solves the integer program IPa,c f° r anv feasible right hand side b. 

In the literature of discrete optimization, Grobner bases are often called test sets. A lower 
bound on gap- (A) is given by inspecting the coordinates of the Grobner basis with respect to the 
cost vector c = eoo-o- 

Theorem 3 (|5j, Corollary 4.3). The value gap-(A) is greater than or equal to one less than 
the largest coordinate goo --o of any element in the reduced Grobner basis G c of Aa- 
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The precise definition of the Grobner bases can be found in 7 , however, we will restrict to a special 
family of models where the Grobner basis elements we need have a simpler description. For this, 
we will need to recall the definition of the Graver basis. Note that any integer vector u, can be 
written uniquely as u = u + — u~, where u + and u~ are nonnegative with disjoint support. 

Definition 4. A nonzero integer vector u G ker(^A) is called primitive is there does not exist 
an integer vector v € ker(^A) \ {0> u } such that v + < u + and v < u . The set of vectors 
{u € ker(AA)|u is primitive} is called the Graver basis of A A . 

Given a simplicial complex T on [n — 1] there is a natural construction of a new simplicial 
complex A = logit(T) on [n] which corresponds to taking the logit model with a binary response 
variable. The new model is defined as 

logit(T) := {S U {n}\S G T} U 2^ 

where 2^ n ~^ is the set of all subsets of [n — 1]. Note that ker(^4r) and VeT{Ai ogit ^) are isomorphic, 
and there is a natural identification: u € ker(^4r) if and only if (u, — u) E ker(Ai ogit (j^j). This follows 
by inspecting the condition required by the margin associated to the facet [n — 1] of logit(T). A 
fundamental fact about logit models is that their Grobner bases are easy to describe in terms of 
the Graver basis of A-p, namely: 

Theorem 5 (|7j Theorem 7.1). Let T be a model and A = logitiY) then: 

1. Gr(A A ) = {(u,-u)\ueGr(A T )}, 

2. {geGr(A A )\c-g>0}QG c . 

Note that Theorem El is only true when the response variable is binary. We now have all the 
tools in hand to construct our example. 

3 The Construction 

Our main result is the following: 

Theorem 6. For each n > 3, there is a hierarchical model A n on n-binary random variables such 
that 

gap^{A n ) > 2 n ~ 3 - 1. 

A similar statement about exponential growth of the gap for upper bounds can be derived by 
an analogous arument. 

Proof. Our strategy will be to construct a hierarchical model A n which has Grobner basis elements 
whose entry is large. This will force the large gap by Theorem |31 
Let T n be the hierarchical model on n — 1 random variables 

T n = {S\S C [n - 2], S + [n - 2]} U {{n - 1}}. 
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That is, T n is the union of the boundary of an n — 3 simplex together with an isolated point. Take 
A n = logit(r n ). To show the theorem with respect to A n is suffices to show that A^ n has elements 
in its Graver basis that have large entries in their coordinate, by Theorem OD 
Consider the vector 

f n = 2™- 3 e (o ,0) + J2 e (i)1) - (2™" 3 - l)e (0il) - £ e (i>0) . 

i i^O,JZ ij^ven i| ijodd 

Here e^k) denotes the standard unit vector whose index is (i, k) G {0, l} ra_1 ; that is, en^ is 
the integral table whose only nonzero entry is a one in the (i, k) position. Note that i G {0, f } n ~ 2 
is an index on the first n — 2 random variables. 

We will now show that f n is a primitive vector in ker(^4r„)- First we must show that f n G 
ker(^4r n ); that is, the positive and the negative part of f n have the same margins with respect to 
r n . However, the margin with respect to any of the subsets S C [n — 2], S ^ [n — 2] is the same: 
namely, it is the vector m n given by 

m n = (2- 3 - l) eo + e - 

iG{0,l}"" 3 

The margin with respect to {n — 1} is the vector given by 

m; = 2™~ 3 eo + (2"- 3 -l) ei . 

In particular, these margins are the same and so f n belongs to ker(Ar n ). 

Now we must show that f n is a primitive vector in ker(^4r n )- Suppose to the contrary that 
there was some nontrivial g n G ker(Ar n ) such that g+ < f+ and < f~. Suppose that one of 
the coordinates of g^ was nonzero in a position indexed by some (i, 1) with ^ ij even. Then this 
forces g^ to have nonzero entries in all the possible positions indexed by (i, 1) with ^ ij even if the 
margins with respect to the S C [n — 2] are to be the same in and g~. However, this implies 
that the margin of g+ with respect to {n — 1} has an entry of 2 n_3 — 1 in the 1 position. This 
forces g n = f n if g n G ker(^4r n )- On the other hand, since g ra ^ 0, it must have some positive entry. 
However, its only positive entry could not be in the (0, 0) position since this would force a negative 
entry in some position (i,0). By the preceding argument, this implies that g n = f n and thus f n is 
a primitive vector. 

□ 

To explicitly construct an example of a set of margins b with respect to A n where the gap 
between the LP and IP optima is 2 ra ~ 3 — 1 just take 

u = (2 n ~ 3 - l)e (0 , ,o) + Yl e (i,i,o) + (2"" 3 - l)e (0 ,i,i) + ^ e (W)> 

i\iy^0,J2 ijeven \\^2ijodd 

and b = A& n \i. It follows that u cannot be improved to an nonnegative integer table with smaller 
(0, 0, 0) coordinate by appealing to the Grobner basis. However, the nonnegative rational vector 
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2 n— 3 1 

v = u ~ 2 "-3 ( f "'~ f ") 
has the same margins b as u but has (0, 0, 0) coordinate 0. 

4 Discussion 

In this paper, we constructed an example to show that the gap between the linear programming 
lower bounds and the integer programming lower bounds for a cell entry can be exponentially large 
in the number of binary random variables of a hierarchical model. Previous explicit constructions 
of this type [3] gave gaps that were linear in the number of random variables. 

There are a number of possible modifications to our result which can be made, to produce 
examples of different flavors. For instance, small modifications of our argument can be used to 
produce exponential gaps between the linear programming and integer programming upper bounds 
for cell entries. Furthermore, by adding extra dimensions by subdividing A, and using some of 
the techniques in [3], one can produces instances of purely graphical models with these exponential 
growth properties. 

While it is not clear how often, given a random collection of margins b, one should expect 
to encounter the exponentially large gaps we have demonstrated, we expect that for problems on 
large sparse tables, large gaps between the LP and IP solutions will be not be exceptional. This 
feeling is based on the observation that if any gap value can occur, then so can all the integer values 
smaller than this gap. This suggests that research needs to be done to determine better heuristics 
for approximating bounds on cell entries in large sparse tables. 

References 

[1] L. Buzzigoli and A. Gusti. An algorithm to calculate the lower and uppoer bounds of the 
elements of an array given its marginals, in Statistical Data Protection Proceedings, Eurostat, 
Luxembourg (1999) pp. 131-147. 

[2] S.D. Chowdhury, G.T. Duncan, R. Krishnan, S.F. Roehrig and S. Mukherjee, "Disclosure 
Detection in Multivariate Categorical Databases: Auditing Confidentiality Protection Through 
Two New Matrix Operators. Management Science (1999) 45 No. 12, 1710-23. 

[3] L. Cox and J. George. Controlled rounding for tables with subtotals. Annals of Operations 
Research 20 (1989) 141-157. 

[4] M. Develin and S. Sullivant. Markov bases of binary graph models. Annals of Combinatorics, 
7 (2003), pp. 441-466 

[5] S. Ho§ten and B. Sturmfels. Computing the integer programming gap. To appear in Combi- 
natorica, 2003. 

[6] S. Lauritzen. Graphical Models. Oxford University Press, New York, 1996. 



5 



[7] B. Sturmfels. Grobner Bases and Convex Polytopes, American Mathematical Soceity, Provi- 
dence, RI, 1995. 



6 



